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Abstract 

To address the contextual bandit problem, we 
propose an online random forest algorithm. The 
analysis of the proposed algorithm is based on 
the sample complexity needed to find the optimal 
decision stump. Then, the decision stumps are 
recursively stacked in a random collection of de¬ 
cision trees. Bandit Forest. We show that the 
proposed algorithm is near optimal. The depen¬ 
dence of the sample complexity upon the number 
of contextual variables is logarithmic. The com¬ 
putational cost of the proposed algorithm with re¬ 
spect to the time horizon is linear. These analyt¬ 
ical results allow the proposed algorithm to be 
efficient in real applications, where the number 
of events to process is huge, and where we ex¬ 
pect that some contextual variables, chosen from 
a large set, have potentially non-linear dependen¬ 
cies with the rewards. In the experiments done to 
illustrate the theoretical analysis. Bandit For¬ 
est obtain promising results in comparison with 
state-of-the-art algorithms. 


1 Introduction 

By interacting with streams of events, machine learning al¬ 
gorithms are used for instance to optimize the choice of 
ads on a website, to choose the best human machine in¬ 
terface, to recommend products on a web shop, to insure 
self-care of set top boxes, to assign the best wireless net¬ 
work to mobile phones. With the now rising internet of 
things, the number of decisions (or actions) to be taken by 
more and more autonomous devices further increases. In 
order to control the cost and the potential risk to deploy a 
lot of machine learning algorithms in the long run, we need 
scalable algorithms which provide strong theoretical guar¬ 
antees. 
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Most of these applications necessitate to take and optimize 
decisions with a partial feedback. Only the reward of the 
chosen decision is known. Does the user click on the pro¬ 
posed ad ? The most relevant ad is never revealed. Only 
the click on the proposed ad is known. This well known 
problem is called multi-armed bandits (mab). In its most 
basic formulation, it can be stated as follows; there are K 
decisions, each having an unknown distribution of bounded 
rewards. At each step, one has to choose a decision and re¬ 
ceives a reward. The performance of a MAB algorithm is 
assessed in terms of regret (or opportunity loss) with re¬ 
gards to the unknown optimal decision. Optimal solutions 
have been proposed to solve this problem using a stochas¬ 
tic formulation in Auer et al ( 2002| l, using a Bayesian for¬ 
mulation in Kaufman et al | ( 2012| l, or using an adversarial 
formulation in Auer et al '( |2002| l. While these approaches 
focus on the minimization of the expected regret, the PAC 
setting (see Vailant (1984 1 ) or the (e, (5)-best-arm identifi¬ 
cation, focuses on the sample complexity (i.e. the number 
of time steps) needed to find an e-approximation of the best 
arm with a failure probability of 6. This formulation has 
been studied for MAB problem in |Even-Dar et al ] ( 2002| l; 
Bubeck et al (2009 |l, for dueling bandit problem in Urvoy 


et al (20131, and for linear bandit problem in Soare et al 


Several variations of the MAB problem have been intro¬ 
duced in order to fit practical constraints coming from 
ad serving or marketing optimization. These varia¬ 
tions include for instance the death and birth of arms in 
Chakrabarti et al ( 2008| l, the availability of actions in 
Kleinberg et al (2008|l or a drawing without replacement 


Feraud and Urvoy |2012 2013|l. But more importantly. 


in most of these applications, a rich contextual information 
is also available. For instance, in ad-serving optimization 
we know the web page, the position in the web page, or 
the profile of the user. This contextual information must be 
exploited in order to decide which ad is the most relevant 
to display. Following [Langford and Zhang ( [2007 1 ; Dudik 
et al ( 201 Ij l, in order to analyze the proposed algorithms, 
we formalize below the contextual bandit problem (see Al¬ 
gorithm [^l. 

Let xt € {0,1}'^ be a vector of binary values describ¬ 
ing the environment at time t. Let yt S [0,1]^ be a vec- 
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tor of bounded rewards at time t, and yk^ (t) be the reward 
of the action (decision) kt at time t. Let be a joint 
distribution on (x, y). Let ^ be a set of K actions. Let 
TT : {0,1}^^ —^ be a policy, and If the set of policies. 


Algorithm 1 The contextual bandit problem 

repeat 

(xt, yt) is drawn according to 
xt is revealed to the player 
The player chooses an action kt = 7rt(xt) 
The reward (t) is revealed 
The player updates its policy ttj 
f = f -f 1 
until t = T 


Notice that this setting can easily be extended to categorical 
variables through a binary encoding. The optimal policy tt* 
maximizes the expected gain: 


TT* = argmaxEr,^ ^ [y7r(xt)] 

TTtll 

Let kt = TT* (xt) be the action chosen by the optimal policy 
at time t. The performance of the player policy is assessed 
in terms of expectation of the accumulated regret against 
the optimal policy with respect to D^^y'. 

T 

t=i 

where R{T) is the accumulated regret at time horizon T. 


a random forest built knowing the joint distribution of the 
contexts and rewards. 


In comparison to algorithms based on search of the best 
policy from a finite set of policies (see Auer et al p002 ]l; 
Dudik et al (20111; Agrawal et al p014|l) our approach 


has several advantages. First, we take advantage of the fact 
that we know the structure of the set of policies to obtain 
a linear computational cost with respect to the time hori¬ 
zon T. Second, as our approach does not need to store 
a weight for each possible tree, we can use deeper rules 
without exceeding the memory resources. In comparison 
to the approaches based on a linear model (see LinUCB in 
|Li et al | ( |2010| l), our approach also has several advantages. 
First, it is better suited for the case where the dependence 
between the rewards and the contexts is not linear. Sec¬ 
ond, the dependence of regret bounds of the proposed algo¬ 
rithms on the number of contextual variables is in the order 
of (T(log M) while the one of linear bandits is in Ois/M) 
Third, its computational cost with respect to time hori¬ 
zon in 0{LMDT) allows to process large set of variables, 
while linear bandits are penalized by the update of a M x M 
matrix at each update, which leads to a computational cost 
mO{KM^T). 


3 The decision stump 

In this section, we consider a model which consists of a 
decision stump based on the values of a single contextual 
variable, chosen from at set of M binary variables. 


2 Our contribution 


3.1 A gentle start 


Decision trees work by partitioning the input space in 
hyper-boxes. They can be seen as a combination of rules, 
where only one rule is selected for a given input vector. 
Finding the optimal tree structure (i.e. the optimal com¬ 
bination of rules) is NP-hard. For this reason, a greedy 
approach is used to build the decision trees offline (see 


[Breiman et al (19841) or online (see Domingos and Hul- 


ten 


(20001). The key concept behind the greedy deci¬ 
sion trees is the decision stump. While Monte-Carlo Tree 
Search approaches (see Kocsis and Szepesvari ( 2006| l) fo¬ 
cus on the regret minimization (i.e. maximization of gains), 
the analysis of proposed algorithms is based on the sample 
complexity needed to find the optimal decision stump with 
high probability. This formalization facilitates the analy¬ 
sis of decision tree algorithms. Indeed, to build a decision 
tree under limited resources, one needs to eliminate most 
of possible branches. The sample complexity is the stop¬ 
ping criterion we need to stop exploration of unpromising 
branches. In Bandit Forest, the decision stumps are re¬ 
cursively stacked in a random collection of L decision trees 
of maximum depth D. We show that Bandit Forest al¬ 
gorithm is near optimal with respect to a strong reference: 


In order to explain the principle and to introduce the no¬ 
tations, before describing the decision stump used to build 
Bandit Forest, we illustrate our approach on a toy prob¬ 
lem. Let ki and /c 2 be two actions. Let Xi^ and Xi^ be 
two binary random variables, describing the context. In 
this illustrative example, the contextual variables are as¬ 
sumed to be independent. Relevant probabilities and re¬ 
wards are summarized in Tablej^ |u denotes the condi¬ 
tional expected reward of the action ^2 given = v, and 
P{xi^ = v) denotes the probability to observe Xi^ = v . 

We compare the strategies of different players. Player 1 
only uses uncontextual expected rewards, while Player 2 
uses the knowledge of Xi^ to decide. According to Table[2 
the best strategy for Player 1 is to always choose action 
k 2 - His expected reward will be = 7/16. Player 2 is 
able to adapt his strategy to the context: his best strategy 
is to choose k 2 when Xi-^^ = Vg and ki when x^^ = Vi. 


*In fact the dependence on the number of contextual variables 
of the gap dependent regret bound is in O(M^) (see Theorem 5 
in| Abbasi-Yadkori and Szepesvari |j2011|). 
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Table 1: The mean reward of actions fci and k 2 knowing 
each context value, and the probability to observe this con¬ 
text. 



Vo 

Vi 


0 

1 


3/5 

1/6 

P{Xi^ = v) 

5/8 

3/8 


1/4 

3/4 


9/24 

5/8 

P{Xi^ = v) 

3/4 

1/4 


the actions in A (see Algorithm Inline 4). Each time tk the 
reward of the selected action k is unveiled, the estimated 
expected rewards of the played action k and observed val¬ 
ues jl\. ^ and all the estimated rewards of variables jA are 
updated (see VE function lines 2-7). This parallel explo¬ 
ration strategy allows the algorithm to explore efficiently 
the variable set. When all the actions have been played 
once (see VE function line 8), irrelevant variables are elim¬ 
inated if: 


A 


A* + e > 4 



log 


AKMtl 


( 1 ) 


According to Table his expected reward will be: 

= P{xi^ = uo) • |uo + P{xi^ = vi) ■ p.l\ |ui 
= l^l\,vo + l^kuv, = 3/4 , 

where and /r// denote respectively the expected 

reward of the action fc 2 and Xi^ = uq, and the expected re¬ 
ward of the action fci and Xi^ = vi. Whatever the expected 
rewards of each value, the player, who uses this knowledge, 
is the expected winner. Indeed, we have: 

-b maxp^fc > max/r* 

Now, if a third player uses the knowledge of the contextual 
variable Xi^, his expected reward will be: 

Ml:,., = 15/32 

Player 2 remains the expected winner, and therefore Xi^ is 
the best contextual variable to decide between ki and k 2 . 

The best contextual variable is the one which maximizes 
the expected reward of best actions for each of its values. 
We use this principle to build a reward-maximizing deci¬ 
sion stump. 

3.2 Variable selection 

Let V be the set of variables, and A be the set of actions. 
Let pL\\v = ^Dy [Vk ■ ^xi=v\ be the expected reward of the 
action k conditioned to the observation of the value v of the 
variable x^. Let = P{v).pl\v = • 1^,=.] 

be the expected reward of the action k and the value v of 
the binary variable Xi. The expected reward when using 
variable Xi to select the best action is the sum of expected 
rewards of the best actions for each of its possible values: 

.G{0,1} .G{0,1} 

The optimal variable to be used for selecting the best action 
is: i* = argmaxigv M*- 

The algorithm VARIABLE SELECTION chooses the best 
variable. The Round-robin function sequentially explores 


where i' = argmax^ jP, and tk is the number of times the 
action k has been played. 


Algorithm 2 Variable Selection 
Inputs: e G [0,1), 6 G (0,1] 

Output: an e-approximation of the best variable 
1: t = 0, Vfc tk = 0, V(i, k, v) 0, Vi /t* = 0 

2 : repeat 

3: Receive the context vector xt 

4: Play k = Round-robin (Vl) 

5: Receive the reward ?/fe (f) 

6: ffe = ffc -b 1 

7: V=VE(4,fc,xt,i/fc, V,Vl) 

8: t = t+l 

9: until |V| = 1 


1: Function VE(f,/c, Xt, yfe, V, Vl) 

2: for each remaining variable i G V do 
3: for each value v do 

A. r.i — Uk-K LArp 

Mfc,. “ t ^Xi=v + t P-k,v 

5: end for 

6: A* = E.G{o.i}'^axfcp/^ 

7: end for 

8: If k = Last Action (A) then 

9: Remove irrelevant variables from V according to 

equation [T] or[^ 

10: end If 
11: return V 


The parameter 6 G (0,1] corresponds to the probability of 
failure. The use of the parameter e comes from practical 
reasons. The parameter e is used in order to tune the con¬ 
vergence speed of the algorithm. In particular, when two 
variables provide the highest expected reward, the use of 
e > 0 ensures that the algorithm stops. The value of e has 
to be in the same order of magnitude as the best mean re¬ 
ward we want to select. In the analysis of algorithms, we 
will consider the case where e = 0. Lemma 1 analyzes the 
sample complexity (the number of iterations before stop¬ 
ping) of Variable Selection. 
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Lemma 1: when K > 2, M > 2, and e = 0, the sam¬ 
ple complexity o/VARIABLE SELECTION needed to obtain 
TP {i' y^i*)<S is: 


t* 


64iC, SilTM 

where A 


i^i* 




Lemma 1 shows that the dependence of the sample 
complexity needed to select the optimal variable is in 
O(logM). This means that VARIABLE SELECTION can be 
used to process large set of contextual variables, and hence 
can be easily extended to categorical variables, through a 
binary encoding with only a logarithmic impact on the sam¬ 
ple complexity. Finally, Lemma 2 shows that the variable 
selection algorithm is optimal up to logarithmic factors. 


Lemma 2: There exists a distribution such that any 
algorithm finding the optimal variable i* has a sample 
complexity of at least: 


n 



log 


1 

S 


3.3 Action selection 


1: Function AE(f,/c,xt, 2 /fe,.A) 

2: Afc = X + 

3: if fc=LastAction(.4) then 

4: Remove irrelevant actions from A according to 

equation]^ or|^ 

5: end if 
6: return A 


where k' = arg max^ fik, and tk is the number of times the 
action k has been played. 


Lemma 3: when K > 2, and e = 0, the sam¬ 
ple complexity o/ACTION SELECTION needed to obtain 
TP{k' ^k*)<S is: 

GAK AK 

t* = log where A 2 = min(/rfe. - fik)- 
1X2 01X2 ^ 


The proof of Lemma 3 is the same than the one provided 
for Successive Elimination in |Even-Dar et al | ( |2006l i. 
Einally, Lemma 4 states that the action selection algorithm 
is optimal up to logarithmic factors (see |Mannor and Tsit 
siklis ( 2004|l Theorem 1 for the proof). 


To complete a decision stump, one needs to provide an 
algorithm which optimizes the choice of the best action 
knowing the selected variable. Any stochastic bandit al- 


gorithm such as UCB (see Auer et al ( 

2002||), TS (see 

ITiompson ( 

1933|l; Kaufman et al (2012 

1 ) or BESA (see 

Baransi et a 

(2014b) can be used. Eor the consistency 


of the analysis, we choose SUCCESSIVE ELIMINATION in 
Even-Dar et al (2002 2006 | l (see Algorithm]^, that 


have renamed ACTION SELECTION. Let p,k = Hoylyk] 
be the expected reward of the action k taken with respect to 
Dy. The estimated expected reward of the action is denoted 
by fik- 


Algorithm 3 Action Selection 
Inputs: e S [0,1), 6 € (0,1] 

Output: an e-approximation of the best arm 
1: t — 0,'ik jj-k = 0 and tk = 0 

2: repeat 

3: Play k = Round-robin (A) 

4: Receive the reward yk (t) 

5: tk = tk T 

6: A=AE(tk,k,yk(t),A) 

7: f = f-f 1 

8: until |A| = 1 


The irrelevant actions in the set A are successively elimi¬ 
nated when; 


Afe' - Afc + e > 



( 2 ) 


Lemma 4: There exists a distribution D^^y such that any 
algorithm finding the optimal action k* has a sample com¬ 
plexity of at least: 





log 


1 

d 


3.4 Analysis of a decision stump 


The decision stump uses the values of a contextual variable 
to choose the actions. The optimal decision stump uses the 
best variable to choose the best actions. It plays at time 
t: kl = argmaxfc/i^ |u, where i* = argmaxi/r®, and 
V = Xi- {t). The expected gain of the optimal policy is: 

[Vkt (f)] = XI 1^ = X = F** 


To select the best variable, one needs to find the best action 
of each value of each variable. In DECISION Stump al¬ 
gorithm (see Algorithm]^, an action selection task is allo¬ 
cated for each value of each contextual variable. When the 
reward is revealed, all the estimated rewards of variables jA 
and the estimated rewards of the played action knowing the 
observed values of variables jji\,\v are updated (respectively 
in VE and AE functions): the variables and the actions are 
simultaneously explored. However, the elimination of ac¬ 
tions becomes effective only when the best variable is se¬ 
lected. Indeed, if an action k is eliminated for a value vq 
of a variable i, the estimation of , the mean expected 
reward of the action k for the value vi, is biased. As a con¬ 
sequence, if an action is eliminated during the exploration 
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of variables, the estimation of the mean reward /i® can be 
biased. That is why, the lower bound of decision stump 
problem is the sum of lower bound of variable and action 
selection problems (see Theorem 2). The only case where 
an action can be eliminated before the best variable be se¬ 
lected, is when this action is eliminated for all values of all 
variables. For simplicity of the exposition of the DECISION 
Stump algorithm we did not handle this case here. 


Algorithm 4 Decision Stump 

1 

t = 0, Wk tk = 0, Vi = 

0, - 0, 


Ai^v = A, = 0 and 

= 0 

2 

repeat 


3 

Receive the context vector 

Xt 

4 

if 1 V| > 1 then Play k = Round-robin (Vl) 

5 

else Play k = Round-Robin (Ai^v) 

6 

Receive the reward yk{t) 


7 

tk = tk + 1 


8 

V=VE(ffc,fc,xt,yfc, V,Vl) 


9 

for each variable i G V do 


10 


+ 1 

11 


Vk ; ) 

12 

end for 


13 

t = t+l 


14 

until t = T 



Theorem 1: when K > 2, M > 2, and e — 0, the 
sample complexity needed by DECISION Stump to obtain 
V (i' i*) < 6 and V {kf ^ k^) < S is: 

^ 64:K , AKM QAK , AK 

where Ai = minimi* (/r® — /i®), 
and Aa = minfe^fc.,„g{o.i} {l4-,v “ 

Theorem 2 provides a lower bound for the sample complex¬ 
ity, showing that the factors 1 /Af and 1/A^ are inherent 
of the decision stump problem. Notice that for the linear 
bandit problem, the same factor 1/A^ was obtained in the 
lower bound (see Lemma 2 in |Soare et al | ( |20l'4| )). 

Theorem 2: It exists a distribution such that any 
algorithm finding the optimal decision stump has a sample 
complexity of at least: 



Remark 1: The factors in log 1/Ai, log I/A 2 and log 1 /A 
could vanish in Theorem 1 following the same approach as 
that Median Elimination in |Even-Dar et al ( 2006| l. Despite 
the optimality of this algorithm, we did not choose it for the 
consistency of the analysis. Indeed, as it suppresses 1/4 of 


variables (or actions) at the end of each elimination phase. 
Median Elimination is not well suited when a few number 
of variables (or actions) provide lot of rewards and the oth¬ 
ers not. In this case this algorithm spends a lot of times to 
eliminate non-relevant variables. This case is precisely the 
one, where we would like to use a local model such as a 
decision tree. 

Remark 2: The extension of the analytical results to an e- 
approximation of the best decision stump is straightforward 
using Ag = max(e, A). 


4 Bandit Forest 


A decision stump is a weak learner which uses only one 
variable to decide the best action. When one would like 
to combine D variables to choose the best action, a tree 
structure of .2^ multi-armed bandit problems has to 
be allocated. To explore and exploit this tree structure with 
limited memory resources, our approach consists in com¬ 
bining greedy decision trees. When decision stumps are 
stacked in a greedy tree, they combine variables in a greedy 
way to choose the action. When a set of randomized trees 
vote, they combine variables in a non-greedy way to choose 
the action. 


As highlighted by empirical studies (see for instance 


Eernandez-Delgado et al (2014l), random forests of 


Breiman ^2001|l have emerged as a serious competitors to 


state-of-the-art methods for classihcation tasks. In IBiau I 
( 2012| l, the analysis of random forests shows that the reason 
of these good performances comes from the fact that the 
convergence of its learning procedure is consistent, and its 
rate of convergence depends only on the number of strong 
features, which is assumed to be low in comparison to the 
number of features. In this section, we propose to use a 
random forest built with the knowledge of as a refer¬ 
ence for the proposed algorithm Bandit Eorest. 


Algorithms 0-OGT (cg,c?6/, Vcj) 

Inputs: 6 G [0,1] 

Output: the 0-optimal greedy tree when 0-OGT ((), 0, V) 
is called 

1: iidg < Dg then 

2: 5,. ={zG Vg; :/(0,c;,z) = l} 

3: = argmaxie5^. p®|c; 

4: 0-OGT (c; + = 0), d, + 1, V.j \ 

5: d-OGT (cl + = 1), d, + 1, Ve; \ 

6: else k*\c*g = arg max^. |cg 

7: end if 


Let 0 be a random variable, which is independent of x and 
y. Let 0 G [0,1] the value of 0. Let Dg be the maximal 
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depth of the tree 9. Let cg = {{xi^ = v),..., {xi^^ = v)} 
be the index of a path of the tree 9. We use ce(xt) to 
denote the path cg selected at time t by the tree 9. Let 
f{9,i,j) : [0,1] X {0,1}^° X M —>• {0,1} be a func¬ 
tion which parametrizes Vcg the sets of available variables 
for each of 2^® splits. We call 6*-optimal greedy tree, the 
greedy tree built with the knowledge of and condi¬ 
tioned to the value 9 of the random variable 0 (see Algo¬ 
rithm]^. We call optimal random forest of size L, a random 
forest, which consists of a collection of L 0-optimal greedy 
trees. At each time step, the optimal random forest chooses 
the action fcj", which obtains the higher number of votes: 

L 

kl = argmax^ , 

i=l 

where fej. ^ is the action chosen by the optimal greedy tree 
9i at time t. 


Algorithm 6 Bandit Forest 
1: t = 0, Cg = 0 and dg = 0, V0 NewPath(ce, V) 

2: repeat 

3: Receive the context vector xt 

4: for each 9 do select the context path ce(xt) 

5: if dg — Dg and |5ce| = 1, and V(0,i,u) 

= 1 then k = argmax^ Y.i=i '^ke„t=k 
6 : else k = Round-robin(A) 

7: endif 

8: Receive the reward (f) 

9: for each 9 do 

10 : ^CQ,k — ^ce,k 1 

11: 5e,=VE(f 

C0,k-) ^5 ytt,yk,Sce,A) 

12: for each remaining variable i do 

13- Xiiff tce,i,v,k ^ce,i,v,k “f 1 

14 . Ac(f,i,v — ,i,v ,k: k, , yi^, Acq ,i,v') 

15: end for 

16: if |5cs I = 1 and dg < Dg then 

17: = Vc, \ {z} 

18: NewPath(ce + {xi = 0), Vcg) 

19: NewPath(ce -f (x^ = 1), VcJ 

20: end if 

21: end for 

22: t = t+l 

23: until t = T 


• For each tree 9, the path cg is selected (line 4). 

• An action k is selected: 

- If the learning of all paths eg is finished, then 
each path vote for its best action (line 5). 

- Else the actions are sequentially played (line 6). 

• The reward ykit) is received (line 8). 

• The decision stumps of each path cg are updated (lines 
9-15). 

• When the set of remaining variables of the decision 
stump corresponding to the path cg contains only one 
variable and the maximum depth Dg is not reached 
(line 16), two new decision stumps corresponding to 
the values 0,1 of the selected variable are allocated 
(lines 18-20). The random set of remaining variables 
Veg, the counts, and the estimated means are initial¬ 
ized in function NewPath. 

To take into account the L decision trees of maximum depth 
Dg, irrelevant variables are eliminated using a slight mod¬ 
ification of inequality Q. A possible next variable Xi is 
eliminated when: 


/t* \cg-fl’-\cg + e > 4:\ 


2f, 


cg,k 


-log- 


4 X 2DKMDgUl 


cg,k 


(3) 


where i' = argmax^gVeg iA\cg, and tcg,k is the number of 
times the path eg and the action k have been observed. To 
take into account the L decision trees, irrelevant actions are 
eliminated using a slight modiheation of inequality Q: 


Afe'|c0 — dk\cg +e > 2< 


1 

log--— 


^^cg ,i,v,k 


(4) 


where fc' = argmax^ fik\cg^ tcg,i,v,k is the number of 
times the action k has been played when the path cg, and 
the value v of the variable i have been observed. 


1: Function NewPath(ce, Ves) 

2: Seg ={iG Veg ■■ f{9,cg,i) = 1} 

3: dg = dg + 1 , V(i, v) Acg,i,v = A 

4. VA: ,k — 'd {t, V, k^ tf^g — 0,V2)i| Cg — 0 

5: y{i,k,v) filjcg = 0 and eg) = 0 


Theorem 3: when K >2, M >2, and e = 0, the sample 
complexity needed by BANDIT POREST learning to obtain 
the optimal random forest of size L with a probability at 
least 1 — ^ is: 

, n/64A:, AKMDL QdK, ALK\ 


Bandit Eorest algorithm explores and exploits a set of L where Ai = ming P{cg) (m* [cg — 

decision trees knowing 9i, ..., 9^ (see Algorithm]^. When A 2 = mingle;,fc/fe* P{cg) iltk* |cg — fifcjcg), and 

a context Xt is received (line 3): D — maxg Dg. 
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The dependence of the sample complexity on the depth 
D is exponential. This means that like all decision trees, 
Bandit Forest is well suited for cases, where there is a 
small subset of relevant variables belonging to a large set of 
variables {D « M). This usual restriction of local mod¬ 
els is not a problem for a lot of applications, where one can 
build thousands of contextual variables, and where only a 
few of them are relevant. 


Theorem 4: There exists a distribution D^ y such that 
any algorithm finding the optimal random forest of size L 
has a sample complexity of at least: 


n 



■ 1 



K\og 


1 

6 


size 100. The optimal random forest is obtained by train¬ 
ing a random forest of size 100 not limited by depth on the 
whole dataset with full information feedback and without 
noise. 
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Theorem 4 shows that Bandit Forest algorithm is near 
optimal. The result of this analysis is supported by empiri¬ 
cal evidence in the next section. 

Remark 4: We have chosen to analyze Bandit Forest 
algorithm in the case of e = 0 in order to simplify the con¬ 
cept of the optimal policy. Another way is to define the 
set of e-optimal random forests, built with decision stumps 
which are optimal up to an e approximation factor. In this 
case, the guarantees are given with respect to the worst pol¬ 
icy of the set. When e = 0, this set contains only the opti¬ 
mal random forest of size L given the values of 0. 

5 Experimentation 

In order to illustrate the theoretical analysis with repro¬ 
ducible results on large sets of contextual variables, we 
used three datasets from the UCIMachine Learning Repos¬ 
itory (Forest Cover Type, Adult, and Censusl990). We 
recoded each continuous variable using equal frequencies 
into 5 binary variables, and each categorical variable into 
disjunctive binary variables. We obtained 94, 82 and 255 
binary variables, for Forest Cover Type, Adult and Cen- 
susl990 datasets respectively. For Forest Cover Type, we 
used the 7 target classes as the set of actions. For Adult, the 
categorical variable occupation is used as a set of 14 ac¬ 
tions. For Censusl990, the categorical variable Yearsch is 
used as a set of 18 actions. The gain of policies was evalu¬ 
ated using the class labels of the dataset with a reward of 1 
when the chosen action corresponds to the class label and 0 
otherwise. The datasets, respectively composed of 581000, 
48840 and 2458285 instances, were shuffled and played in 
a loop to simulate streams. In order to introduce noise be¬ 
tween loops, at each time step the value of each binary vari¬ 
able has a probability of 0.05 to be inverted. Hence, we can 
consider that each context-reward vector is generated by a 
stationary random process. We set the time horizon to 10 
millions of iterations. The algorithms are assessed in terms 
of accumulated regret against the optimal random forest of 


Figure 1: The accumulated regret of Bandit Forest and 
Bandit Tree with different depths against the optimal 
policy averaged over ten trials on Forest Cover Type dataset 
played in a loop with noisy inputs. 


To effectively implement Bandit Forest, we have done 
two modifications of the analyzed algorithms. Firstly, 
the Round-robin function is replaced by an uniform ran¬ 
dom draw from the union of the remaining actions of the 
paths. Secondly, the rewards are normalized using Inverse 
Propensity Scoring (see Horvitz and Thompson ( 1952| l): 
the obtained reward is divided by the probability to draw 
the played action. First of all, notice that the regret curves 
of Bandit Forest algorithm are far from those of ex¬ 
plore then exploit approaches; the regret is gradually re¬ 
duced over time (see Figure [T]). Indeed, Bandit Forest 
algorithm uses a localized explore then exploit approach: 
most of paths, which are unlikely, may remain in explo¬ 
ration state, while the most frequent ones already vote for 
their best actions. The behavior of the algorithm with re¬ 
gard to number of trees L is simple; the higher L, the 
greater the performances, and the higher the computational 
cost (see Figure [^. To analyze the sensitivity to depth of 
Bandit Forest algorithm, we set the maximum depth 
and we compared the performances of a single tree without 
randomization (Bandit Tree) and of Bandit Forest 
with L = 100. The trees of the forest are randomized at 
each node with different values of e (between 0.4 and 0.8), 
and with different sets of available splitting variables (ran¬ 
dom subset of 80% of remaining variables). For each tested 
depth, a significant improvement is observed thanks to the 
vote of randomized trees (see Figure [^1. Moreover, Ban¬ 
dit Forest algorithm appears to be less sensible to depth 
than a single tree; the higher the difference in depth, the 
higher the difference in performance. 

To compare with state-of-the-art, the trees in the forest are 
randomized with different values of the parameters D (be¬ 
tween 10 and 18). On the three datasets, Banditron 
(Kakade et al ( 2008)1 ) is clearly outperformed by the other 
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Sample time 

Figure 2: The accumulated regret against the optimal pol¬ 
icy averaged over ten trials for the dataset Forest Cover 
Type played in a loop with noisy inputs. 



Sample time 


Figure 4: The accumulated regret against the optimal pol¬ 
icy averaged over ten trials for the dataset Adult played in 
a loop with noisy inputs. 



Figure 3: Sensitivity to the size of Bandit Forest on 
Census 1990 dataset. 
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Figure 5: The accumulated regret against the optimal pol¬ 
icy averaged over ten trials for the dataset Censusl990 
played in a loop with noisy inputs. 


tested algorithms (see Figures |^|^|^. Neural Bandit 
(Allesiardo et al ( 2014| l) is a Committee of Multi-Layer 
Perceptrons (MLP). Due to the non convexity of the er¬ 
ror function, MLP trained with the back-propagation al¬ 
gorithm ( Rumelhart et al | (1986i) does not find the opti¬ 
mal solution. That is why finding the regret bound of such 
model is still an open problem. Flowever, since MLPs are 
universal approximators (Hornik et al ( 1989) 1), Neural 
Bandit is a very strong baseline. It outperforms Lin¬ 
UCB ( |Li et al |P010| l) on Forest Cover Type and on Adult. 
Bandit Forest clearly outperforms LinUCB and Ban¬ 
ditron on the three datasets. In comparison to Neural 
Bandit, Bandit Forest obtains better results on Cen¬ 
sus 1990 and Adult, and it is outperformed on Forest Cover 
Type, where the number of strong features seems to be high. 
Finally as shown by the worst case analysis, the risk to use 
Bandit Forest on a lot of optimization problems is con¬ 
trolled, while Neural Bandit, which has no theoretical 
guaranty, can obtain poor performances on some problems, 
such as Census 1990 where it is outperformed by the linear 
solution LinUCB. This uncontrolled risk increases with 
the number of actions, since the probability to obtain a non 
robust MLP linearly increases with the number of actions. 


6 Conclusion 

We have shown that the proposed algorithm is optimal up 
to logarithmic factors with respect to a strong reference: 
a random forest built knowing the joint distribution of the 
contexts and rewards. In the experiments. Bandit For¬ 
est clearly outperforms LinUCB, which is a strong base¬ 
line with known regret bound, and performs as well as 
Neural Bandit, for which we do not have theoretical 
guaranty. Finally, for applications where the number of 
strong features is low in comparison to the number of pos¬ 
sible features, which is often the case. Bandit Forest 
shows valuable properties: 

• its sample complexities have a logarithmic depen¬ 
dence on the number of contextual variables, which 
means that it can process a large amount of contextual 
variables with a low impact on regret, 

• its low computational cost allows to process efficiently 
infinite data streams, 

• like all decision tree algorithms, it is well suited to 
deal with non linear dependencies between contexts 
and rewards. 
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7 Appendix 

7.1 Notations 

In order to facilitate the reading of the paper, we provide below (see Table]^ the list of notations. 


Table 2: Notations 

notation description 

K number of actions 

M number of contextual variables 

Dg maximum depth of the tree 6 

L number of trees 

T time horizon 

A set of actions 

V set of variables 

S set of remaining variables 

X context vector x = (cci, ..., Xm) 

y reward vector y = (yi,..., 

kt action chosen at time t 

eg context path the tree 0, eg = 

dg current depth of the context path cg 

Hk expected reward of action k, = Ed,, [Uk] 

/i^|u expected reward of action k conditioned to Xi = v, [yk ■ txi=v] 

expected reward of action k and Xi = v, „ = E^,^ ^ [yk ■ txi=v] 
yd expected reward for the use of the variable Xi to select the best actions 

5 probability of error 

e approximation error 

Ai minimum of difference with the expected reward of the best action 

and the expected reward of a given action k: Ai = — p^) 

A 2 minimum of difference with the best variable expected reward p* 

and the expected reward for other variables: A 2 = mini^^* (p* — p*) 
t* sample complexity of the decision stump 


7.2 Lemma 1 


Proof. We cannot use directly Hoeffding inequality (see Hoeffding~| ( |1963| l) to bound the estimated gains of the use of 
variables. The proof of Lemma 1 overcomes this difficulty by bounding each estimated gain p* by the sum of the bounds 
over the values of the expected reward of the best action p^, ^ (inequality |^. From Hoeffding’s inequality, at time t we 
have: 


IP {\Pk,v - dk,v \ > atj < 2exp{-2a^tjk) 


6 

4KMtl ’ 


where at^ = 


1 , AKMtl 


2t 


Using Hoeffding’s inequality on each time applying the union bound and then ^ 1 /t^ = the following inequality 

c 2 

holds for any time t with a probability 1 — 


12KM- 


Pk,v ~ < Pk,v ^ Pk,v + “tfe 


( 5 ) 
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If the inequality (j^ holds for the actions k' = arg max^ and k* = arg max^ we have: 


f^k',v ~ “tfc ^ < k‘k*,v + “tfc < P-k',v + ^tk 

^ Afc',11 ~ Oltk ^ k-k* ,v — Afc'.i) + Cttfc 

If the previous inequality •ii* holds for all values v of the variable Xi, we have: 


^ ^ {f^k^,v ^tk) — ^ ^ k^k* ,v — ^ ^ {kk\v 

t)G{0,l} ’'£{0,1} ’'SfO,!} 

— 2atk ^ ^ A* + 

If the previous inequality holds for i' = arg max^ fk, we have: 


A* - 2atfc < Ai* < M* 

As a consequence, the variable Xi cannot be the best one when: 

A* + ^ A* ~ ‘^'^tk 5 


(7) 

( 8 ) 

(9) 


Using the union bound, the probability of making an error about the selection on the next variable by eliminating each 
variable Xi when the inequality (|^ holds is bounded by the sum for each variable Xi and each value v that the inequality]^ 
does not hold for k' and k*: 


P(A^A)<E 


KStt"^ ^ 
2AKM - 



( 10 ) 


iGV 


iGV 


Now, we have to consider the number of steps needed to select the optimal variable. If the best variable has not been 
eliminated (probability 1 — 6), the last variable Xi is eliminated when: 

A'* - A* > 4at* 


The difference between the expected reward of a variable Xi and the best next variable is defined by: 


A, = p* 


Assume that: 


A* > 

The following inequality holds for the variable Xi with a probability 1 — : 

A* - ‘^ottk < M* < A* + 2at, 


Then, using the previous inequality in the inequality (11 1 , we obtain: 


( 11 ) 


(A* + “^cutk) ~ (A* + 2at^) > p* ~ 


Hence, we have: 

A** - A* > 4at^ 

The condition A^ > Aatk implies the elimination of the variable Xi. Then, we have: 


Aj > Aatk 
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8 , 4,KMtl 
^ A2 > - log 

tk 0 


( 12 ) 


The time where all non optimal variables have been eliminated, is reached when the variable corresponding to the 
minimum of is eliminated. 


.2^8, 4KMtf 


where A = 


minimi 


•A,,. 


The inequality (13 i holds for all variables i with a probability 1 — (5 for; 


,, 64 , AKM 

'‘= 


Indeed, if we replace the value of in the right term of the inequality ([T2)i, we obtain: 


(13) 


(14) 


8 log 5. A 

A2 

81 og^ 

A2 

4KM 
<5.A 


ARM 


_ 64 

4KM \ log + 21og ^ + 21oglog j 


AKM\ 


S.A 


A ARM , , , , , ARM\ 

hog—^-41ogA + 121og2 + 21oglogj < 


Slog 


4 log 


ARM 


12 log 2 + 2 log log 


ARM\ 

S.A I 


For X > 13, we have: 

Hence, for ARM > 13, we have; 


12 log 2 + 2 log log a; < 4 log x 


A2 


81og^ 
A2 


, ARM , , , ARM\ 

A log ——-h 12 log 2 + 2 log log — 1 < 


4KM \ --o 

ARM 


S.A ) 


Slog 


4KM 

5.A 


Slog 


(5. A 


= A" 


Hence, we obtain: 


64 ARM 

1% = log —^— , with a probability A — S 


As the actions are chosen the same number of times (Round-robin function), we have t = Rtk, and thus: 

, 64A: , ARM 

t* = 


log —z— , with a probability 1 — 5 
A^ 0 


□ 


7.3 Lemma 2 

Proof. Let be a bounded random variable corresponding to the reward of the action k when the value v of the variable 
i is observed. Let y* be a random variable such that: 
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We have; 


Each y® is updated each step tk when each action has been played once. Let 0 be the sum of the binary random variables 
01 ,9t^.,9ti such that 9t^ = ^yi{tk)>y^tk)■ Let pij be the probability that the use of variable i leads to more rewards 
than the use of variable j. We have: 


Pij = - - Aij , where = p® - 


Slud’s inequality (see Slud ||( 1977 1 ) states that when p < 1/2 and < x < f^.(l — p), we have: 


P(0 >x)>P 



x-tl-p \ 


(15) 


where Z is a normal A/^(0,1) random variable. 


To choose the best variable between i and j, one needs to find the time where P(0 > f^/2) > S. To state the number of 
trials tl needed to estimate A^ , we recall and adapt the arguments developed in Mousavi ( |2010| l. Using Slud’s inequality 
(see equation [TS]), we have: 


p{e > ti/2) > p 



V^k-Pvi^-p^]) J 


Then, we use the lower bound of the error function (see|Chu](|1955|l): 


(16) 


P{Z >z)>l 



Therefore, we have; 


P{& > tl/2) > 1 - 


> 1 - 


\ 


1 — exp — 




2pij{l Pij) 


\ 


1 — exp 




P^j 


> O exp - 

2 \ Pij 


As Pij = 1/2 — Aij, we have: 




= n 



Hence, we have; 
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Then, we need to use the fact that as all the values of all the variables are observed when one action is played: the 
M{M — l)/2 estimations of bias are solved in parallel. In worst case, miny = min^ = A. Thus any algorithm 
needs at least a sample complexity t*, where: 


f = K.t] 




□ 


7.4 Theorem 1 

Proof. Lemma 1 states that the sample complexity needed to find the best variable is: 

MK , 4iTM 




ft = 


Lemma 3 states that the sample complexity needed to find the optimal action for a value v of the best variable is: 

64iT AK 

tiv = log T-r—, where Aa,^ = ). 

A 2 ^ 01\2,v 


The sample complexity of decision stump algorithm is bounded by the sum of the sample complexities of variable selection 
and action elimination algorithms: 


t* = + i* , where = maxf^ 

V ’ 


□ 


7.5 Theorem 2 

Proof. In worst case, all the values of variables have different best actions, and K = 2M. If an action is suppressed 
before the best variable is selected, the estimation of the mean reward of one variable is underestimated. In worst case this 
variable is the best one, and a sub-optimal variable is selected. Thus, the best variable has to be selected before an action 
be eliminated. The lower bound of the decision stump problem is the sum of variable selection and best arm identification 
lower bounds, stated respectively in Lemma 2 and Lemma 4. 

□ 


7.6 Theorem 3 

Proof. The proof of Theorem 3 uses Lemma 1 and Lemma 3. Using the slight modification of the variable elimination 
inequality proposed in section 3, Lemma 1 states that for each decision stump, we have: 

P (*djce ^ i'de\ce) < 

For the action corresponding to the path eg, Lemma 3 states that: 

P(A: ^ k*) < 

V ^ ) - 2DgX 

From the union bound, we have: 


P(3cesuch thatc^ ^ c*g) < 6 and P(3fcsuch thatfc k*) < 6 
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Using Lemma 1 and Lemma 3, and summing the sample complexity of each 2^® variable selection tasks and the sample 
complexity of each 2^** action selection tasks, we bound the sample complexity of any tree 0 by: 


t* <2 


n QAK , AKMDL 
log- 


A? 


5Ai 


AKL 

^ log ■ 


Ai 




where D = maxDg. 


□ 


7.7 Theorem 4 


Proof. To build a decision tree of depth Dg, any greedy algorithm needs to solve J2d<De variable selection 

problems (one per node), and 2^® action selection problems (one per leaf). Then, using Lemma 2 and Lemma 4, any 
greedy algorithm needs a sample complexity of at least: 


t* >n[2 


',D 


1 1 
^ +AiJ 


K\og- 


□ 


7.8 Additional experimental results 

We provide below (see Table the classihcation rates to compare the asymptotical performances of each algorithm, and 
the processing times. 


Table 3: Summary of results on the datasets played in a loop. The regret against the optimal random forest is evaluated 
on ten trials. Each trial corresponds to a random starting point in the dataset. The conhdence interval is given with a 
probability 95%. The classihcation rate is evaluated on the last 100000 contexts. The mean running time was evaluated on 
a simple computer with a quad core processor and 6 GB of RAM. 
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